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ABSTRACT 

This  article  deals  with  the  problem  of  factoring  a  large  sparse  positive  definite  matrix  on  a 
multiprocessor  system.  The  processors  are  assumed  to  have  substantial  local  memory  but 
no  globally  shared  memory.  They  communicate  among  themselves  and  with  a  host 
processor  through  message  passing.  Our  primary  interest  is  in  designing  an  algorithm 
which  exploits  parallelism,  rather  than  in  exploiting  features  of  the  underlying  topology  of 
the  hardware.  However,  part  of  our  study  is  aimed  at  determining,  for  certain  sparse 
matrix  problems,  whether  hardware  based  on  the  binary  hypercube  topology  adequately 
supports  the  communication  requirements  for  such  problems.  Numerical  results  from 
experiments  running  on  a  multiprocessor  simulator  are  included. 
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1.  Introduction 

This  article  deals  with  the  problem  of  factoring  a  large  sparse  positive  definite  matrix 
A  on  i  multiprocessor  system.  It  is  assumed  that  the  system  supports  message  passing 
among  individual  processors,  and  that  each  processor  has  a  substantial  amount  of  local 
memory.  We  assume  also  that  there  is  no  globally  shared  memory.  These  assumptions 
are  appropriate  for  a  number  of  recent  commercially  available  machines,  such  as  the 
binary  hypercube  multiprocessors  marketed  by  Ametek.  Intel  and  NCUBE  corporations.  In 
[8],  a  parallel  algorithm  was  developed  for  solving  dense  positive  definite  systems  on  such 
machines,  so  this  article  can  be  regarded  as  a  sequel  to  that  work,  in  which  the  sparsity  of 
the  problem  is  addressed  and  exploited. 

The  process  of  solving  large  sparse  positive  definite  systems  typically  involves  four 
distinct  steps: 

1.  C Ordering )  Find  a  good  ordering  P  for  A  .  That  is.  a  permutation  matrix  P  so 

that  PAPt  has  a  sparse  Cholesky  factor  L .  This  is  usually  referred  to  as  the 
ordering  problem . 

2.  ( Symbolic  factorization)  Determine  the  structure  of  the  Cholesky  factor  L  of 
PAPt  .  and  set  up  a  data  structure  for  this  factor. 

3.  ( Numerical  factorization )  Place  the  elements  of  A  into  the  data  structure,  and 

then  compute  L . 

4.  ( Triangular  solution)  Using  the  computed  L  .  solve  the  triangular  systems  Ly  -Pb  . 
LT  z-y .  and  then  set  x  -PT z . 

The  problems  of  implementing  an  ordering  algorithm  and  performing  the  symbolic 
factorization  procedure  on  a  multiprocessor  machine  are  major  projects  that  will  be 
considered  in  a  subsequent  article.  In  this  paper  we  develop  and  test  a  parallel  algorithm 
for  step  3  only. 

Before  proceeding  with  the  description  and  details  of  the  algorithm,  some  general 
remarks  about  the  design  and  implementation  of  parallel  algorithms  should  be  made. 
First,  it  should  be  kept  in  mind  that  the  objective  is  speed-up.  That  is,  given  a  p -processor 
machine,  we  would  like  to  solve  our  problem  in  time  that  is  as  close  as  possible  to  a  factor 
of  p  less  than  that  needed  to  solve  the  same  problem  on  a  single  processor  version  of  the 
machine,  using  the  best  serial  algorithm  available.  Of  course  in  the  latter  case  we  assume 
that  the  single  processor  machine  has  adequate  memory,  presumably  much  more  than  the 
amount  available  to  a  single  processor  in  the  multiprocessor  configuration. 

There  is  a  tendency  to  focus  on  processor  utilization  in  studying  parallel  algorithms. 
However,  while  high  processor  utilization  is  a  necessary  condition  for  good  speed-up,  it  is 
clearly  not  sufficient:  the  processors  have  to  be  doing  useful  work.  Thus,  in  order  to 
achieve  our  objective,  it  is  necessary  to  be  able  to  distribute  the  computation 
approximately  uniformly  across  the  processors,  to  identify  sufficient  parallelism  so  that 
most  of  the  computations  can  be  performed  simultaneously,  and  to  reduce  the  amount  of 
communication  among  the  processors. 

Let  us  assume  that  we  are  able  to  achieve  this  distribution.  Except  in  unusual 
circumstances,  some  communication  among  the  processors  will  be  required  during  the 
computation.  This  leads  us  to  an  important  point  about  communication  traffic. 
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Ideally,  every  processor  in  the  system  should  be  able  to  send  a  message  directly  to 
any  other  processor.  However,  for  large  p  .  economics  make  building  machines  with  such  a 
capability  infeasible,  so  most  local-memory  multiprocessors  provide  physical 
communication  links  among  only  a  few  nearest  neighbors  in  some  geometric  layout. 
(Common  topologies  include  the  ring,  the  two-dimensional  regular  grid  and  the  binary 
hypercube.)  A  consequence  is  that  a  message  to  be  sent  from  processor  i  to  processor  j 
may  have  to  traverse  several  physical  links,  and  be  forwarded  by  processors  along  the 
transmission  path. 

It  is  therefore  useful  to  distinguish  between  logical  and  physical  data  traffic.  By  the 
logical  traffic  from  processor  i  to  processor  j .  we  mean  the  amount  of  data  originated 
from  processor  i  that  must  be  received  and  utilized  by  processor  j .  On  the  other  hand, 
we  use  physical  traffic  from  i  to  j  to  refer  to  the  total  amount  of  data  traffic  that  actually 
flows  on  the  physical  link  (assuming  it  exists)  from  processor  i  to  j  in  the  multiprocessor 
network.  If  there  is  no  direct  link  between  processors  i  and  j .  the  amount  of  physical 
traffic  will  always  be  zero  even  if  there  is  some  logical  data  traffic  between  them.  In  this 
case,  data  originated  from  processor  i  and  required  by  processor  j  has  to  travel  through 
one  or  more  intermediate  processors  in  some  transmission  path  before  reaching  j  . 

It  is  clear  that  logical  traffic  is  determined  by  the  way  in  which  the  total  computation 
has  been  distributed  across  the  processors,  and  physical  traffic  further  depends  on  the 
underlying  hardware  topology  and  routing  strategies.  Loosely  speaking,  logical  traffic  is  a 
function  of  the  algorithm  only,  while  physical  traffic  is  a  function  of  both  the  algorithm 
and  the  hardware. 

An  outline  of  the  paper  is  as  follows.  In  Section  2,  we  review  the  basic  Cholesky 
algorithm  for  the  dense  matrix  case,  and  examine  the  effect  of  ordering  for  the  sparse  case. 
Although  we  defer  the  discussion  of  a  parallel  algorithm  for  computing  an  ordering  until  a 
later  paper,  the  choice  of  the  ordering  can  have  a  drastic  effect  on  both  the  sparsity  of  the 
triangular  factor  and  the  degree  of  parallelism  that  can  be  exploited  in  computing  it.  For 
the  numerical  experiments  reported  in  this  paper,  the  ordering  is  computed  by  a  standard 
sequential  algorithm.  The  design  and  implementation  of  the  parallel  algorithm  for  sparse 
numerical  factorization  are  presented  in  Section  3,  and  the  results  of  our  numerical 
experiments  in  Section  4. 

2.  Sparse  Cholesky  Factorization 

2.1.  Dense  Case:  the  Basic  Algorithm 

We  begin  by  providing  a  column-oriented  version  of  the  basic  Cholesky  factorization 
algorithm,  described  in  the  following  algorithmic  form. 
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for  y  >  1  to  n  do 
begin 

for  k  1  to  y  —  1  do 
for  i  j  to  n  do 

aij  >  —  a*  ***)* 

ajj  ^a)J 

for  k  y  +1  to  n  do 

akJ  akJ  !  aJ) 

end 


It  is  shown  in  [9]  that  this  form  of  Cholesky  factorization,  the  so-called  column- 
Cholesky  formulation,  is  particularly  well  suited  to  medium-  to  coarse-grain  parallel 
implementation.  It  was  found  to  have  the  best  combination  of  work-load  balance  and 
overlapped  execution  in  the  outer  loop  sub-tasks.  This  version  is  implemented  for 
shared-memory  multiprocessors  in  [9],  and  for  various  local-memory  architectures 
supporting  message  passing  in  [8.13]. 

Following  [9],  we  let  Tcol(j)  be  the  task  that  computes  the  y-th  column  of  the 
Cholesky  factor.  Each  such  task  consists  of  the  following  two  types  of  subtasks: 

1.  cmod  (y  X  )  :  modification  of  column  y  by  column  k  (.k  <y  ); 

2.  cdiv(j)  :  division  of  column  y  by  a  scalar. 

Thus,  in  terms  of  these  sub-tasks,  the  basic  algorithm  can  be  expressed  in  the 
following  condensed  form. 

for  y  1  to  n  do 
begin 

for  k  >  1  to  j  —  1  do 
cmod  (y  £  ) 

cdiv  (y  ) 

end 


We  now  consider  the  potential  for  parallelism  in  the  above  formulation  of  the 
algorithm.  We  implicitly  assume  throughout  this  paper  that  the  cmod  and  cdiv  operations 
are  atomic  in  the  sense  that  we  do  not  attempt  to  exploit  parallelism  within  them, 
although  such  exploitation  is  clearly  possible. 

Note  first  that  cdiv{j )  cannot  begin  until  cmod  (y  Jc)  has  been  completed  for  all 
k  < y ,  and  column  y  can  be  used  to  modify  subsequent  columns  only  after  cdiv(j)  has 
been  completed.  However,  there  is  no  restriction  on  the  order  in  which  the  cmod 
operations  are  executed,  and  cmod  operations  for  different  columns  can  be  performed 
concurrently.  For  example,  after  c<fiv(l)  has  completed,  cmod  ( 2,1)  and  cmod  (3.1)  could 
execute  in  parallel.  These  precedence  relations  are  depicted  in  Fig.  1. 
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cmod(j+lJ)  cmod{j+2,j) 


cmod(n  ,j) 


cmod(j, 


cmod(j,j— 1) 


Fig.  1:  Subtask  precedence  graph  for  column-Cholesky. 


2.2.  Parallel  Sparse  Column-Cholesky  and  the  Effect  of  Ordering 

The  main  difference  between  the  sparse  and  dense  versions  of  the  algorithm  stems 
from  the  fact  that  for  sparse  A  .  column  j  may  no  longer  need  to  be  modified  by  all 
columns  k  <  j .  Specifically,  column  j  is  modified  only  by  columns  k  for  which  ljk  ^O, 
and  after  cdiv(y')  has  been  executed,  column  j  needs  to  be  made  available  only  to  tasks 
Tcolir')  for  which  ^0-  This  can  be  understood  easily  by  examining  the  basic  form  of 
the  algorithm  displayed  at  the  beginning  of  section  2.1.  If  aJk  =  0.  it  is  obviously 
unnecessary  to  execute  the  loop  on  i .  since  it  has  no  effect. 

Ideally,  we  would  like  to  choose  an  ordering  for  the  matrix  A  which  achieves  a 
number  of  objectives.  First,  just  as  in  the  use  of  serial  machines,  we  would  like  to 
preserve  sparsity  and  obtain  a  low  arithmetic  operation  count.  In  addition,  the  ordering 
should  allow  a  high  degree  of  parallelism,  and  allow  the  distribution  of  the  computation 
across  the  processors  in  a  way  that  allows  the  parallelism  to  be  exploited  without 
requiring  an  inordinate  amount  of  communication. 

Fortunately,  these  objectives  turn  out  to  be  mutually  complementary.  In  order  to 
gain  insight  into  this  problem,  it  is  useful  to  introduce  the  notion  of  elimination  trees  for 
sparse  Cholesky  factors  [3.16]. 

Consider  the  structure  of  the  Cholesky  factor  L .  For  each  column  j  ^rt ,  if  column 
j  has  off-diagonal  nonzeros,  define  y [j  ]  by 

y[j]  =  min  {  i  I  ti}  ^0,  i  >  j  }  : 

that  is,  y[j  ]  is  the  row  subscript  of  the  first  off-diagonal  nonzero  in  column  j  of  L .  If 
column  j  has  no  off-diagonal  nonzero,  we  set  y [j  \  ~  j  ■  (Hence  y[n  ]=n  .) 

We  now  define  an  eliminatUm  tree  corresponding  to  the  structure  of  L .  The  tree  has 
n  nodes,  labelled  from  1  to  n  .  lor  each  j  .  if  y[j]  >j,  then  node  y[j  ]  is  the  parent  of 
node  j  in  the  elimination  tree,  and  node  j  is  one  of  possibly  several  child  nodes  of  node 
y [/].  We  assume  that  the  matrix  A  is  irreducible,  so  that  n  is  the  only  node  with 


y[j  ]  =  j  and  it  is  the  root  of  the  tree.  Thus,  for  1  ^  j  <  n  .  y[j  ]  >  j .  (If  A  is  reducible, 
then  the  elimination  tree  defined  above  is  actually  a  forest  which  consists  of  several  trees.) 
There  is  exactly  one  path  from  each  node  to  the  root  of  the  tree.  If  node  t  lies  on  the  path 
from  node  j  to  the  root,  then  node  i  is  an  ancestor  of  node  j .  and  node  j  is  a  descendant 
of  node  i . 

An  example  to  illustrate  the  notion  of  elimination  trees  is  provided  by  the  structure 
of  the  Cholesky  factor  shown  in  Fig.  2.  with  the  associated  elimination  tree  being  shown  in 
Fig.  3.  Elimination  trees  have  been  used  either  implicitly  or  explicitly  in  numerous  articles 
dealing  with  sparse  symmetric  factorization  [1.2,3,6,7.12,14,16.17,18,19.20].  In  particular, 
the  paper  [18]  uses  the  elimination  tree  as  a  model  to  study  the  parallel  sparse  Cholesky 
factorization  algorithm  in  a  shared-memory  multiprocessor.  In  addition.  Duff  [2]  is 
exploring  the  use  of  elimination  trees  in  the  parallel  implementation  of  multifrontal 
methods. 

x 

x 

X  X 
X  X 

X  X  X  X  X 
X  XXX 

Fig.  2:  Structure  of  a  Cholesky  factor. 


Fig.  3:  The  elimination  tree  associated  with  the  Cholesky  factor  in  Fig.  2. 

The  elimination  tree  provides  precise  information  about  the  column  dependencies. 
Specifically,  cdiv(i)  cannot  be  executed  until  cdiv(j)  has  completed  for  all  descendant 
nodes  j  of  node  i . 

The  elimination  tree  has  simple  structure  that  can  be  economically  represented  using 
y.  as  shown  in  Fig.  4.  Thus,  the  representation  requires  only  a  single  vector  of  size  n  . 
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y[j] 

3  5  4  5  6  6 

Fig.  4:  Computer  representation  of  the  tree  of  Fig.  3. 

In  order  to  see  the  role  that  elimination  trees  might  play  in  identifying  parallelism, 
we  now  consider  two  different  orderings  of  the  same  problem,  and  study  their 
corresponding  elimination  trees.  Consider  a  3  by  3  grid  problem,  where  the  9  vertices  of 
the  grid  are  numbered  in  some  manner,  and  the  associated  matrix  A  has  the  property  that 
dij^O  if  and  only  if  vertex  t  and  vertex  j  are  associated  with  the  same  small  square  in 
the  grid.  Two  different  orderings  of  the  grid  are  given  in  Fig.  5.  the  associated  Cholesky 
factors  are  displayed  in  Fig.  6,  and  their  corresponding  elimination  trees  are  shown  in  Fig. 
7. 
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Fig.  6:  Structure  of  the  Ch.ilesky  factors  for  the  orderings  of  Fig.  5. 
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Fig.  7:  The  elimination  trees  associated  with  the  matrices  in  Fig.  6. 

The  elimination  tree  on  the  left  is  typical  of  those  generated  by  orderings  that  are 
good  in  the  sense  of  yielding  low  fill  and  low  operation  counts.  Its  tree  structure  is  short 
and  wide,  and  such  trees  and  their  associated  orderings  lend  themselves  well  to  parallel 
computation.  For  example,  it  should  be  clear  that  Tcoli  1).  Tcolil).  Ted  (3).  and  Tcoli 4) 
can  start  immediately  in  parallel.  Moreover,  when  they  have  completed  execution,  Tcoli 5) 
and  Tcdi 6)  may  proceed  independently.  The  remaining  tasks  are  no  different  than  those 
for  a  dense  matrix,  and  the  findings  in  [8]  apply  equally  well  here. 

On  the  other  hand,  the  band-oriented  ordering  shown  above  is  undesirable  because  it 
imposes  the  same  serial  execution  on  the  ediv  operations  that  is  imposed  in  the  dense  case 
(note,  however,  that  even  in  the  dense  case,  some  emod  operations  can  still  be  carried  out 
concurrently  [9]).  Moreover,  the  operation  counts  and  fill-in  are  inferior  to  that  of  the 
first  ordering. 

In  the  elimination  tree,  if  node  i  and  node  j  belong  to  the  same  level  of  the  tree,  it  is 
clear  that  the  tasks  Tcdii  )  and  Tcoli  j)  can  be  performed  independently  so  long  as  the 
tasks  associated  with  their  descendant  nodes  have  all  been  completed.  In  order  to  gain 
high  processor  utilization,  it  is  therefore  desirable  to  assign,  if  possible,  nodes  on  the  same 
level  of  the  tree  to  different  processors.  An  overall  task  assignment  scheme  will  then 
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correspond  to  assigning  the  Tcolii )  tasks  to  successive  processors  in  a  breadth-first 
bottom-up  manner  from  nodes  of  the  elimination  tree. 

It  should  be  pointed  out  that  some  of  the  practical  fill-reducing  orderings  will 
already  order  the  nodes  of  the  elimination  tree  in  this  desirable  sequence.  They  include 
the  recent  implementation  of  the  minimum  degree  ordering  using  multiple  elimination  [15] 
and  a  version  of  the  nested  dissection  ordering  [  10].  In  such  cases,  the  task  assignment 
scheme  corresponds  to  the  straightforward  wrap-around  assignment,  where  task  Ted  (i ) 
will  be  assigned  to  the  processor  s  .  given  by  s  —  (i  —  1)  mod  p  . 

3.  Design  and  Implementation 

In  this  section,  we  consider  the  design  and  implementation  of  a  sparse  Cholesky 
factorization  algorithm  appropriate  for  a  parallel  multiprocessor  with  local  memory.  Let 
A  be  the  given  n  by  n  sparse  symmetric  positive  definite  matrix  with  Cholesky  factor  L . 
We  assume  that  the  matrix  has  already  been  permuted  by  some  fill-reducing  ordering 
appropriate  for  parallel  elimination. 

As  before,  we  let  Tcolij )  be  the  task  of  computing  the  j  -th  column  of  the  sparse 
Cholesky  factor  L .  This  task  consists  of  the  two  types  of  subtasks:  emodij  ,k )  and 
cdiv{j  ). 

In  the  sparse  case,  the  task  Tcdij)  can  be  expressed  in  the  following  algorithmic 
form: 


for  each  k  with  nonzero  Iji  and  j  >  k  do 
emod  (  j  Jc  ) 
ediv  (  j ) 

It  should  be  clear  that  the  number  of  emod  operations  required  in  the  task  Tcolij)  is 
given  by  the  number  of  off-diagonal  nonzeros  in  the  / -th  row  of  L.  To  facilitate  our 
discussion,  we  introduce  the  vector  nmod[ *].  where  the  value  nmod[j]  is  the  number  of 
column  modifications  emod  required  in  the  execution  of  Ted  ( j ).  This  vector  can  be 
obtained  by  simply  counting  the  number  of  off-diagonal  nonzeros  in  each  row  of  L  . 

Consider  the  symmetric  factorization  of  A  in  a  given  parallel  message-passing 
multiprocessor  environment.  Let  p  be  the  number  of  processors  in  the  parallel  machine. 

We  assume  that  an  assignment  of  the  column  tasks  Tcol{*)  to  the  computational  nodes  of  ■ 

the  multiprocessor  has  been  given.  For  definiteness,  let  map[ «]  be  the  mapping  of  these  n  ] 

tasks  into  the  p  processors.  That  is.  map[j]  will  be  the  processor  that  is  responsible  for  ' 

the  performance  of  the  task  Tcd(j).  and  hence  the  computation  of  column  j  of  L .  It  ' 

should  be  pointed  out  that  the  effect  of  task-to-processor  assignment  on  load  balancing  1 

and  communication  cost  can  be  studied  by  choosing  different  map[*]  functions.  j 

In  the  parallel  environment,  we  further  assume  that  there  are  two  primitives:  send  ? 

and  await  .  Execution  of  a  send  does  not  cause  the  sending  process  to  wait  for  a  reply.  On  J 

the  other  hand,  execution  of  an  await  causes  the  process  executing  it  to  be  suspended  until  J 

the  message  is  received.  Messages  that  arrive  at  the  destination  process  before  the 

execution  of  the  receiving  await  are  placed  in  a  queue  until  needed.  j 

We  shall  now  describe,  in  an  algorithmic  form,  the  work  to  be  performed  by  the  host  ' 

and  node  processors.  Each  node  processor  uses  a  multisend  routine,  which  will  be 

discussed  later  in  detail.  ‘ 


HOST  processor: 

Determine  the  mapping  function  map[*] 
for  s  >  1  to  p  do  /*  broadcast  map  [*  ]  V 
send  map  [*  ]  to  processor  s 

Determine  the  nmod  [*  ]  function 
for  j  :«■  1  to  n  do 

send  column  j  of  A  and  nmod  [ j  ]  to  processor  map  [j  ] 
repeat  n  times  do 

await  a  column  of  L  and  store  it  into  the  data  structure 


NODE  processor  s : 

await  map  [*  ]  from  the  host 

compute  ncol  (using  map  ).  the  number  of  columns  to  be  processed  by  processor  s 

/*  obtain  columns  from  the  host  and  eliminate  if  possible  */ 

repeat  need  times  do 

begin 

await  a  column  j  of  A  and  nmod  [ j  ]  from  the  host 

if  nmod[j ]  -  0  then 

begin 

edivij  ) 

multisend  ( j  ,  L,j  ) 

end 

end 

need  >  need  —  number  of  columns  received  with  zero  nmod 

/*  main  loop:  driven  by  the  incoming  columns  */ 

while  need  >  0  do 

begin 

await  a  column  of  L  ,  say  L.t 

for  each  offdiagonal  nonzero  ljk  with  map[j  ]  =  s  do 
begin 

emod  ( j  &  ) 

nmod  [  j  ]  :-  nmod  [  j  ]— 1 
if  nmod  [j  ]  -  0  then 
begin 

ediv  (y  ) 

multisend  (.j .  L.j) 
need  need—  1 
end 

end 

end 

It  is  clear  that  the  host  processor  is  merely  responsible  for  the  initiation  of  the  tasks  by 
sending  the  relevant  information  to  each  node  processor,  and  then  for  the  collection  of  the 


computed  columns  of  the  factor  matrix  L .  In  each  node  processor,  a  routine  called 
multisend  is  used.  Its  function  is  to  send  the  column  L,}  to  the  host  processor  and  also  to 
all  the  node  processors  that  require  this  column  for  performing  modifications.  Specifically, 
this  routine  can  be  formulated  as  follows. 

Subroutine  multisend  (y  .  L.}  ): 

for  each  processor  d  such  that  for  some  i  >  j  .  l,t  ^  0  and  map[i  ]  =  d  do 
send  L,j  to  processor  d 

send  L.j  to  the  host 

It  should  be  emphasized  that  the  routine  multisend  should  only  send  one  copy  of  the 
column  L,j  to  a  processor  even  though  the  processor  may  use  this  column  to  modify  more 
than  one  column  in  this  processor.  Furthermore,  the  routing  strategy  in  the  distribution 
of  the  column  L*}  to  the  processors  concerned  can  be  changed  by  simply  coding  a  new 
version  of  multisend . 

There  are  a  few  points  worth  mentioning  in  the  scheme  for  each  node  processor.  As 
soon  as  a  column  L,}  of  L  is  completely  formed,  it  is  immediately  sent  to  the  other 
processors  that  need  this  column,  including  the  node  processor  that  computed  L.j  if  that 
node  processor  also  needs  L.j  .  A  node’s  sending  messages  to  itself  in  such  circumstances 
makes  the  logic  and  programming  much  cleaner.  This  should  not  result  in  a  significant 
performance  penalty  in  any  reasonable  multiprocessor  design  since  it  should  involve 
merely  an  internal  movement  of  data.  The  immediate  transmission  of  completely  formed 
columns  allows  an  overlapping  of  column  elimination  and  column  input  from  the  host  in 
the  repeat  loop  in  the  algorithm.  More  importantly,  by  making  columns  of  L 
immediately  available,  this  will  reduce  wait  time  on  node  processors. 

Note  also  that  the  main  loop  is  driven  by  the  incoming  columns  of  L .  This  implies 
that  the  parallel  algorithm  is  working  at  the  granularity  level  of  the  subtasks  cmorf(y  ) 
and  cdivij ).  rather  than  at  the  level  of  the  tasks  Tcol  (y  ).  This  is  in  direct  contrast  to  the 
serial  implementation  of  the  sparse  Cholesky  method  (for  example.  SPARSPAK  [  1 1  ]  or 
YSMP  [5]).  where  each  Tcolij  )  is  executed  and  completed  in  succession. 

Another  important  characteristic  of  this  formulation  is  that  it  is  independent  of  the 
interconnection  network  topology.  In  other  words,  the  parallel  algorithm  as  formulated  is 
applicable  to  any  parallel  multiprocessor  in  a  message-passing  environment.  For  different 
processor  interconnections,  it  may  be  desirable  to  choose  a  different  task-to-processor 
mapping  function  map[*  \  or  a  different  message  routing  strategy.  But  the  basic  algorithm 
remains  unchanged. 

4.  Experiments  and  Conclusions 

In  the  previous  sections  our  discussion  has  been  independent  of  the  interconnection 
topology  of  the  multiprocessor.  Our  objective  has  been  to  distribute  the  workload 
uniformly  and  to  reduce  the  amount  of  communication  that  must  be  performed.  In  this 
section  we  report  some  experimental  results  obtained  from  an  implementation  of  our 
algorithm  running  on  a  binary  hypercube  multiprocessor.  For  background  information 
about  hypercube  multiprocessors,  see  [8]  and  the  references  contained  therein. 


In  order  to  test  our  implementation,  and  to  gain  some  information  on  communication 
traffic,  we  solved  some  finite  element  problems  derived  from  a  sequence  of  L-shaped 
triangular  meshes  described  in  [10].  The  ordering  used  for  these  problems  was  an 
automatic  nested  dissection  ordering  produced  by  the  algorithm  described  in  [10],  The 
Tcol  (i )  tasks  were  assigned  to  the  processors  in  a  simple  serial  wrap-around  manner,  with 
no  account  whatsoever  being  taken  of  the  underlying  topology  of  the  hypercube 
multiprocessor.  Both  the  ordering  and  the  symbolic  factorization  phases  were  done  in 
serial  mode.  Parallel  versions  of  these  algorithms  are  under  development. 

Our  experiments  were  conducted  using  a  binary  hypercube  simulator  written  by  T. 
H.  Dunigan  of  the  Oak  Ridge  National  Laboratory.  For  details  about  the  simulator,  see 
[4]. 

Statistics  on  both  the  logical  and  physical  communication  for  one  of  the  problems 
were  collected,  as  shown  in  the  tables  that  follow.  The  results  reported  are  typical  of 
those  found  in  experiments  for  other  problems  in  the  set  of  nine  problems  in  [10].  The 
entry  in  row  r  and  column  c  of  each  table  is  the  amount  of  data  traffic  from  the  processor 
corresponding  to  row  r  to  the  one  corresponding  to  column  c.  Thus,  the  entries  in  the  last 
row  of  the  tables  represent  traffic  from  the  host  processor  to  the  individual  node 
processors. 

We  have  included  both  communication  counts  and  volume  in  the  statistics. 
Communication  count  simply  refers  to  the  number  of  messages  sent.  Note  that  a  message 
associated  with  the  nonzeros  of  a  column  includes  the  number  of  nonzeros,  the  subscript 
information  and  the  actual  nonzero  values.  The  volumes  reported  are  the  total  number  of 
bytes  transmitted.  In  the  experiments,  an  integer  requires  4  bytes,  and  a  floating  point 
number  requires  8  bytes. 
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Table  1:  Logical  communication  counts  among  8  processors  for  n  =1009. 
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Host  42224  40832  41204  41328  41036  40472  41348  41480 


Table  2:  Logical  communication  volume  among  8  processors  for  n  -1009. 


Table  3:  Physical  communication  counts  for  8  processors  and  n-1009. 
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0 

Table  4:  Physical  communication  volume  among  8  processors  for  n  -1009. 


There  are  several  noteworthy  aspects  of  the  numbers  in  Tables  1-4.  First,  observe 
that  the  logical  communication  is  quite  evenly  distributed  among  all  the  processors.  That 
is.  the  algorithm  generates  about  the  same  amount  of  traffic  between  any  and  every  pair  of 
processors. 

Entries  in  the  logical  communication  tables  associated  with  the  processor  nodes  are  all 
nonzero.  However,  there  are  a  number  of  zero  entries  in  the  physical  communication  table. 
Indeed,  each  zero  in  the  tables  (except  for  the  "Host"  row  and  column)  means  that  a 
physical  link  does  not  exist  between  the  two  associated  processors.  For  example,  there  is 
no  direct  link  between  processors  0  and  3.  The  messages  from  processor  0  to  3  must  be 
directed  through  an  intermediate  processor,  processor  1.  This  will  have  the  effect  of 
increasing  the  physical  traffic  from  processor  0  to  1  and  from  processor  1  to  3.  This 
explains  why  the  nonzero  entries  in  the  physical  communication  tables  are  much  larger 
than  the  corresponding  entries  in  the  logical  communication  tables. 

Furthermore,  it  is  interesting  to  observe  that  the  actual  physical  links  in  the 
hypercube  topology  all  carry  about  the  same  amount  of  traffic.  Thus,  it  would  appear  that 
this  particular  topology  adequately  supports  the  actual  (logical)  traffic  generated  by  the 
algorithm,  at  least  for  this  class  of  sparse  problems. 

In  order  to  determine  what  our  implementation  achieved  in  actual  speed-up,  we  ran 
our  code  using  one  processor  and  eight  processors,  and  in  addition  we  ran  the  best  serial 
code  we  have  available. 

A  comparison  of  the  times  for  the  serial  code  and  the  parallel  code  with  one  processor 
was  done  to  assess  the  cost  incurred  in  the  parallel  implementation  per  se.  It  is 
noteworthy  that  the  penalty  is  quite  substantial,  in  the  neighborhood  of  25  percent.  This 
is  different  from  experience  with  solving  dense  systems  on  multiprocessors,  where  the 
performance  of  the  best  serial  code  and  the  parallel  code  running  on  one  processor  are 
comparable  [8].  This  is  to  be  expected  for  the  dense  case,  since  the  parts  of  the  codes 
where  the  majority  of  the  computation  is  done  are  identical.  However,  serial  codes  for 
sparse  Cholesky  factorization  gain  important  performance  advantages  through  heavy  use 
of  context.  For  example,  efficient  processing  and  storage  of  a  column  depend  on  rapid  and 
direct  access  to  information  about  certain  selected  previous  columns.  This  context  is 


inevitably  lost  in  a  parallel  implementation,  since  the  columns  are  distributed  among 
many  processors,  and  the  use  of  such  context  would  almost  certainly  require  prohibitive 
amounts  of  communication.  Thus,  the  data  structures  and  computational  schemes  used  in 
the  serial  and  parallel  implementations  are  quite  different. 

Another  aspect  of  parallel  sparse  matrix  computations  that  tends  to  make  them  less 
efficient  than  their  dense  counterparts  is  that  the  associated  messages  in  sparse  parallel 
implementations  tend  to  be  shorter.  The  time  required  to  transmit  a  message  from  one 
processor  to  another  typically  involves  a  fixed  startup  time  plus  a  cost  proportional  to  the 
message  length.  It  is  therefore  desirable  to  have  a  few  large  messages  rather  than  many 
small  ones  in  parallel  computations,  but  this  is  difficult  to  achieve  for  sparse  matrix 
computations. 

The  results  of  our  experiments  are  contained  in  Table  5.  Note  that  the  "time" 
reported  is  artificial.  The  simulator  measures  time  simply  as  the  number  of  machine 
instructions  executed,  with  no  distinction  being  made  between  the  relative  cost  of 
executing  instructions  of  different  types. 


RESULTS  ON  SPEED-UP 


n 

serial 

one  processor 

eight  processors 

time 

time 

speed-up 

time 

speed-up 

265 

719606 

1027215 

.70 

285614 

2.52 

406 

1462056 

2005731 

.73 

484443 

3.02 

577 

2567430 

3454271 

.74 

776278 

3.31 

778 

4022592 

5357658 

.75 

1120536 

3.59 

1009 

6112334 

8060091 

.76 

1591583 

3.84 

Table  5:  Speed-up  for  one  processor  and  8-processor  configurations. 
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